Comments for MEDB 5502, Week 02

Topics to be covered

  • What you will learn
    • Three models for predicting a continuous outcome
    • Analysis of variance table
    • Variable selection
    • Stepwise regression
    • Residual analysis
    • Collinearity
    • Mediation

Three models

  • All use a continuous dependent (outcome) variables
  • All include multiple independent variables
  • Multiple linear regression (Week 02)
    • All independent variables are continuous
  • Analysis of covariance (Week 03)
    • Mix of continuous and categorical independent variables
  • Multi-factor analysis of variance (Week 04)
    • All independent variables are categorical

Why three models?

  • Historical precedents
  • Different issues
    • Multicollinearity
    • Mediator variable
    • Risk adjustment
    • Moderator variable
    • Interactions

The general linear model

  • Single model that unites all three models.
  • Use of indicator variables for categorical data
  • Not the same as general IZED linear model

The dialog box for the general linear model

Break #1

  • What you have learned
    • Three models for predicting a continuous outcome
  • What’s coming next
    • Analysis of variance table

Data for multiple linear regression

  • Dependent variable: \(Y_1, Y_2,\ldots,Y_n\)
  • First independent variable: \(X_{11}, X_{12}, \ldots, X_{1n}\)
  • Second independent variable: \(X_{21}, X_{22}, \ldots, X_{2n}\)
  • kth independent variable: \(X_{k1}, X_{k2}, \ldots, X_{kn}\)

Matrix notation, 1 of 3

\[\begin{matrix}Y=& \begin{bmatrix} Y_1\\ Y_2\\ \vdots\\ Y_{j}\\ \vdots\\ Y_n \end{bmatrix} \end{matrix}\]

Matrix notation, 2 of 3

\[\begin{matrix}X=& \begin{bmatrix} X_{11} & X_{21} & \ldots & X_{i1} & \ldots & X_{k1}\\ X_{12} & X_{22} & \ldots & X_{i2} & \ldots & X_{k2}\\ \vdots & \vdots & & \vdots & & \vdots\\ X_{1j} & X_{2j} & \ldots & X_{ij} & \ldots & X_{kj}\\ \vdots & \vdots & & \vdots & & \vdots\\ X_{1n} & X_{2n} & \ldots & X_{in} & \ldots & X_{kn} \end{bmatrix} \end{matrix}\]

Matrix notation, 3 of 3

\[\begin{matrix}X=& \begin{bmatrix} 1 & X_{11} & X_{21} & \ldots & X_{i1} & \ldots & X_{k1}\\ 1 & X_{12} & X_{22} & \ldots & X_{i2} & \ldots & X_{k2}\\ \vdots & \vdots & \vdots & & \vdots & & \vdots\\ 1 & X_{1j} & X_{2j} & \ldots & X_{ij} & \ldots & X_{kj}\\ \vdots & \vdots & \vdots & & \vdots & & \vdots\\ 1 & X_{1n} & X_{2n} & \ldots & X_{in} & \ldots & X_{kn} \end{bmatrix} \end{matrix}\]

Why use multiple linear regression

  • Two competing purposes
    • Mechanisms
      • What variables have an impact on your outcome?
    • Prediction
      • What outcome will you see on tomorrow’s patient?
      • “It is difficult to make predictions, especially about the future.” Niels Bohr

Multiple linear regression model

  • \(Y_i=\beta_0+\beta_1 X_{1i}+\beta_2 X_{2i}+\ldots+\beta_k X_{ki}+\epsilon_i\)
    • Assumptions: \(\epsilon_i\) are
      • independent,
      • same variance,
      • normally distributed.
    • Independent variables must be on interval or ratio scale
      • Nominal/ordinal scales require some care

Estimates for the multiple linear regression model

  • Use least squares to estimate \(\hat\beta_0,\ \hat\beta_1,\ \hat\beta_2,\ \ldots,\ \hat\beta_k\)
    • \(\hat Y_i=\hat\beta_0+\hat\beta_1 X_{1i}+\hat\beta_2 X_{2i}+\ldots+\hat\beta_k X_{ki}\)
    • \(e_i=Y_i-\hat Y_i\)

Sum of squares

  • “Bad” regression results.
    • All \(\hat Y_i\) close to each other
    • Few \(\hat Y_i\) close to \(Y_i\)
  • “Good” regression results.
    • The \(\hat Y_i\) are spread out
    • Many \(\hat Y_i\) close to \(Y_i\)

Analysis of variance table for multiple linear regression, 1 of 3

  • SSR or \(SS_{regression}=\Sigma(\hat Y_i-\bar Y)^2\)
  • SSE or \(SS_{error}\) or \(SS_{residual}=\Sigma(Y_i-\hat Y_i)^2=\Sigma\ e_i^2\)
  • SST or \(SS_{total}=\Sigma(Y_i-\bar Y)^2\)

Analysis of variance table for multiple linear regression, 2 of 3

  • \(df_{regression}=k\)
  • \(df_{error}=n-k-1\)
  • \(df_{total}=n-1\)
    • \(MS=SS/df\)

Analysis of variance table for multiple linear regression, 3 of 3

  • \(F=MSR/MSE\)
  • This tests the hypotheses
    • \(H_0:\ \beta_1=\beta_2=\ldots=\beta_k=0\)
    • \(H_1:\ \beta_j \ne 0\) for at least one j
    • Accept \(H_0\) is F is close to 1
    • Reject \(H_0\) is F is much larger than 1

Example using fat data

R-squared

  • \(R^2=SSR/SST\) or \(1-SSE/SST\)
    • Proportion of explained variation

Example using the fat data

  • 10,548.480/15,079.017 = 0.70

Adjusted \(R^2\)

  • \(1-\frac{MSE}{MST}\) or
  • \(1-\frac{SSE}{SST}\frac{(n-1)}{(n-k)}\)
    • Field textbook suggests a more complex formula
    • Penalizes for model complexity (but not enough)

Live demo, ANOVA and R-squared

Break #2

  • What you have learned
    • Analysis of variance table
  • What’s coming next
    • Variable selection

Avoid needlessly complex regression models

  • “Everything should be made as simple as possible, but not simpler” Albert Einstein (?)
  • “If you have two competing ideas to explain the same phenomenon, you should prefer the simpler one.” Occam’s Razor
  • “The parsimony principle for a statistical model states that: a simpler model with fewer parameters is favored over more complex models with more parameters, provided the models fit the data similarly well.” - ClubVITA

Choosing independent variables is a balancing act

  • Too many is bad
    • Expensive
    • May not replicate well
  • Too few is bad
    • May not explain much of the variation
    • May miss important features
  • Finding the right balance depends on your goals
    • Mechanistic approach versus prediction

Counterpoint on complexity

  • Machine learning algorithms
  • Risk adjustment

Rule of 15

  • Developed by Frank Harrell in a different context
    • Ratio of observations to independent variables
      • n/k > 15
      • Some use 10 instead of 15
    • Smaller ratios imply poor replicability
  • Not a replacement for a power calculation
  • Some researchers have argued against this rule

Comparing models with k and k-1 predictors

  • \(Y_i=\beta_0+\beta_1 X_{1i}+\beta_2 X_{2i}+...+\beta_k X_{ki}+\epsilon_i\)
    • Test \(H_0:\ \beta_j=0\) versus \(H_0:\ \beta_j \ne 0\)
      • Remaining \(\beta\)’s could be anything
    • Use least squares to estimate \(\hat\beta_0,\ \hat\beta_1,\ \hat\beta_2,\ ...,\ \hat\beta_k\)
    • Compute \(T=\frac{\hat\beta_j}{se(\hat\beta_j)}\)
    • Accept \(H_0\) if T is close to zero
    • Reject \(H_0\) if T is large negative or large positive

Testing the impact of chest circumference

Change in R-squared

  • \(Partial\ R^2=0.700-0.693=0.007\)

Partial F test

  • Test \(H_0:\ \beta_{k+1}=\beta_{k+2}=...=\beta_{k+m}=0\)
  • Calculate ANOVA table for k+m independent variables
    • Note various SS and MS values for the “full” model
  • Recalculate ANOVA table for only k independent variables
    • Note various SS and MS values for the “reduced” model.
  • Calculate \(F=\frac{(SSR_{full}-SSR_{reduced})/m}{SSE_{full}/(n-(k+m+1))}\)

Testing the impact of hips and chest, 1 of 2

Testing the impact of hips and chest, 2 of 2

  • \(F=\frac{(SSR_{full}-SSR_{reduced})/m}{SSE_{full}/(n-(k+m+1))}\)
  • \(F=\frac{(10,548.480-9984.086)/2}{4530.537/(252-(1+2+1))}\)
  • \(F=\frac{282.2}{18.26829}=15.4\)

Live demo, partial F test

Break #3

  • What you have learned
    • Variable selection
  • What’s coming next
    • Stepwise regression

What is a good model

  • “Everything but the kitchen sink” model
  • Stepwise models
  • Hierarchical models
  • “Maximimum adjusted R-squared” model
  • “Out of sample error” model
  • “Use your brain” model
  • Some combination of the above?

“Everything but the kitchen sink” model

  • Include anything that seems remotely plausible
  • Advantages
    • Simple computationally
  • Disadvantages
    • Does not identify mechanisms
    • Can be expensive
    • Fails if k is very large

Kitchen sink model for body fat

Stepwise model, 1 of 2

  • Forward selection
  • Backward elimination
  • Composite of both

Stepwise model, 2 of 2

  • Advantages
    • Automated in SPSS and other packages
  • Disadvantages
    • Does not incorporate medical knowledge
    • Does not control Type I error rate

Hierarchical models

  • Sometimes variables fall into natural groups.
    • Demographic features
    • Patient frailty
    • Environmental factors
    • Current treatments
  • Enter data in blocks

“Maximimum adjusted R-squared” model

  • Fit all possible models (\(=2^k\))
    • Select model with largest adjusted R-squared
    • Alternative criteria
      • Mallows Cp
      • AIC, AICc
      • LASSO

“Out of sample error” model

  • Split data into training/test sets
    • Use training data to build the model
    • Use test dataset to evaluate fit

“Use your brain” model

  • Previous research
  • Knowledge of medicine/science
  • Superior to “brainless” approaches
    • “Ginny!” said Mr. Weasley, flabbergasted. “Haven’t I taught you anything? What have I always told you? Never trust anything that can think for itself if you can’t see where it keeps its brain?” from Harry Potter and the Chamber of Secrets, J.K. Rowling.

Use a mixture

  • Use a mixture of science/medicine with automated approaches?
    • Story about an industrial process

Live demo, stepwise regression

Break #4

  • What you have learned
    • Stepwise regression
  • What’s coming next
    • Residual analysis

Using residuals to check for assumption violations

  • Non-normality
    • QQ plot
  • Lack of independence
    • Time sequence plot, Durbin-Watson statistic
      • Only for time-ordered data
  • Unequal variances
    • Sactterplot of residuals versus predicted values
  • Non-linearity
    • Scatterplot of residuals versus each independent variable
    • Scatterplot of residuals versus predicted values

Q-Q plot of residuals

Scatterplot of residuals and chest circumference

Scatterplot of residuals and abdomen circumference

Scatterplot of residuals and hip circumference

Scatterplot of residuals and predicted values

Live demo, residual plots

Break #5

  • What you have learned
    • Residual analysis
  • What’s coming next
    • Collinearity

What is collinearity?

  • Strong interrelationship among the independent variables
    • Also known as
      • multi-collinearity
      • near collinearity
      • ill-conditioning
  • Interrelationship could be just two variables
    • Also could be three or more interrelated variables

Examples of collinearity

  • Birthweight and gestational age predicting length of stay
  • Size of the home and size of the lot predicting sales price
  • Calories from fat, from protein, and from carbohydrates predicting weight gain

What problems are caused by collinearity?

  • Difficulty in variable selection
  • Loss of precision
    • wider confidence intervals
  • Loss of power
    • Need for larger sample sizes
  • Not a violation of assumptions
  • Not a problem if you are only interested in prediction

Fixing collinearity

  • Collect more data
  • Oversample “rare” corners
  • Prune your variables

Measures of collinearity

  • Correlation matrix
  • Tolerance
    • \(Tol_i=1-R_i^2\)
      • \(R_i^2\) for predicting \(i^{th}\) independent variable from remaining independent variables
  • Variance inflation factor
    • \(VIF_i=\frac{1}{Tol_i}\)
    • Increase in \(Var(\hat\beta_i)\) due to collinearity

Collinearity statistics for the fat dataset

What is perfect collinearity?

  • Exact relationship among independent variables
  • Impossible to estimate regression coefficients
  • Examples
    • Measuring temperature in both Fahrenheit and Centigrade
    • Thee percentages adding up to exactly 100%
  • Only solution: drop one or more variables

Live demo, multicollinearity

Break #6

  • What you have learned
    • Collinearity
  • What’s coming next
    • Mediation

What is mediation?

  • “A situation when the relationship between a predictor variable and an outcome variable can be explained by their relationship to a third variable (the mediator)”
    • Andy Field, Section 11.4

Why do you need to understand mediation?

  • Mostly irrelevant for prediction
  • Very important for understanding mechanisms
    • Is there a direct relationship?

An informal assessment of mediation

Live demo, mediation

Summary

  • What you have learned
    • Three models for predicting a continuous outcome
    • Analysis of variance table
    • Variable selection
    • Stepwise regression
    • Residual analysis
    • Collinearity
    • Mediation

Additional topics??